A minimum converted trajectory error (MCTE) approach to high quality speech-to-lips conversion
نویسندگان
چکیده
High quality speech-to-lips conversion, investigated in this work, renders realistic lips movement (video) consistent with input speech (audio) without knowing its linguistic content. Instead of memoryless framebased conversion, we adopt maximum likelihood estimation of the visual parameter trajectories using an audio-visual joint Gaussian Mixture Model (GMM). We propose a minimum converted trajectory error approach (MCTE) to further refine the converted visual parameters. First, we reduce the conversion error by training the joint audio-visual GMM with weighted audio and visual likelihood. Then MCTE uses the generalized probabilistic descent algorithm to minimize a conversion error of the visual parameter trajectories defined on the optimal Gaussian kernel sequence according to the input speech. We demonstrate the effectiveness of the proposed methods using the LIPS 2009 Visual Speech Synthesis Challenge dataset, without knowing the linguistic (phonetic) content of the input speech.
منابع مشابه
Sliding Window-based Speech-to-Lips Conversion with Low Delay
The goal of a good speech-to-lips conversion system is to synthesize high quality, realistic lips movement which is time synchronized with the input speech. Previously, the maximum probability estimation of visual trajectory by Gaussian Mixture Model (GMM) has been successfully proposed and tested for speech-to-lips conversion. It works as a sentence level batch process that convert acoustic sp...
متن کاملSequence error (SE) minimization training of neural network for voice conversion
Neural network (NN) based voice conversion, which employs a nonlinear function to map the features from a source to a target speaker, has been shown to outperform GMM-based voice conversion approach [4-7]. However, there are still limitations to be overcome in NN-based voice conversion, e.g. NN is trained on a Frame Error (FE) minimization criterion and the corresponding weights are adjusted to...
متن کاملAn HMM trajectory tiling (HTT) approach to high quality TTS
We propose an HMM Trajectory Tiling (HTT) approach to high quality TTS, which is our entry to Blizzard Challenge 2010. In HTT, first refined HMM is trained with the Minimum Generation Error (MGE) criterion; then trajectory generated by the refined HMM is to guide the search for finding the closest waveform segment “tiles” in synthesis. Normalized distances between HMM trajectory and those of th...
متن کاملA HMM approach to residual estimation for high resolution voice conversion
Voice conversion systems aim to process speech from a source speaker so it would be perceived as spoken by a target speaker. This paper presents a procedure to improve high resolution voice conversion by modifying the algorithm used for residual estimation. The proposed residual estimation algorithm exploits the temporal dependencies between residuals in consecutive speech frames using a hidden...
متن کاملA Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models
This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010